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Abstract 


Finite sample bounds on the estimation error of the mean by the em¬ 
pirical mean, uniform over a class of functions, can often be conveniently 
obtained in terms of Rademacher or Gaussian averages of the class. If 
a function of n variables has suitably bounded partial derivatives, it can 
be substituted for the empirical mean, with uniform estimation again 
controlled by Gaussian averages. Up to a constant the result recovers 
standard results for the empirical mean and more recent ones about U- 
statistics, and extends to a general class of estimation problems. 

1 Introduction 

Suppose we are given a class F of loss functions / : X — >-[0,1], where X is 
some space, and a vector of independent observations X = (Ah, ...,X n ), obey¬ 
ing some common law of probability /r. The method of empirical risk mini¬ 
mization seeks some f £ F which minimizes the empirical average $ (/ (X)) = 


$ (/ (A@),..., / (!„)), where 



The intuitive motivation of this method is the underlying hope that one thereby 
approximately minimizes the expectation Ex'd>(/(X')) = E (X) (where 
X' is always iid to X). A fundamental problem in learning theory is the justi¬ 
fication of this hope in form of a uniform finite-sample bound of the following 
type: 

For every law /r, every n £ N, and every <5 > 0 there is a number B ( 8 , n) 
such that 



( 1 ) 
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The bound B (6, n) should depend little on the confidence parameter S and go to 
zero as n —> oo. This paper is motivated by the question under what conditions 
such bounds can be found for other functions <P, beyond arithmetic means, such 
as U-statistics or other, more general, nonlinear functions. 

One method to prove bounds of the form CD above, which has gained great 
popularity over the last decade and a half, is the method of Rademacher and 
Gaussian averages (Kolchinskii 2000, Bartlett and Mendelson 2002). Given a 
subset Y C R” one defines 

R (Y) = E sup 'S'' eiyi and G (Y) = E sup Y' 7i2/»> 
y^ Y i 

where the ti are independent uniform {—1,1}-valued random variables and 
the 7 j are independent standard normal variables. The Rademacher aver¬ 
ages R (Y) and the Gaussian averages G (Y) are related by the inequalities 
R (Y) < ybr/2 G (Y) and G (Y) < 3 In (n) R (Y) (see Ledoux and Talagrand 
1991, ). These quantities come into play as follows. 

The random variable to bound is (X) = sup^ gF (E [$ (/ (X'))] — (/ (X))). 

We write 

^ (X) = E X 'I- (X') + [<P (X) - E X 'T (X')] • 

The second term in this decomposition is the deviation of the random variable 
T (X) from its mean, and it can be controlled using the well known bounded 
difference inequality (see McDiarmid 1998 or Boucheron et al 2013, Theorem 
[2] below). The crucial property of the arithmetic mean is that it changes little 
(here at most 1 /n) if only one of its arguments is modified. The bounded 
difference inequality then gives a bound of ydn (1/(5) / (2 n) with probability at 
most S for the second term. For the first term a straightforward symmetrization 
argument gives the bound 

Ex* (X) = E x sup (E [$ (/ (X'))] - * (.f (X))) < -E x [R (F (X))], 
feF n 

where F (X) = {/ (X) = (/ (Xi),..., / (X n )) : f € F} is a random subset of R™. 
Since typically R (F (X)) is of order yjn this term is also of order 1 /y/n. Putting 
the two bounds together gives CD with 

B (5, n ) = ^E x i? (F (X)) + 

Replacing the Rademacher average with the Gaussian average incurs only a 
factor of \J 7t/ 2. Both complexity measures have been very successful, because 
they are often very easy to bound in practice. 

What properties of a general function $ could guarantee similar results? 
Clearly the same decomposition as above is always possible, and the bounded 
difference inequality just requires that $ changes only in the order of 1/n if one 
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of its arguments is modified. This concentration property seems to be a very 
common-sense postulate, which we may retain as a requirement for $. 

The difficulty still lies in the first term, because the usual symmetrization 
argument relies heavily on the linearity of the arithmetic mean. This suggests 
that we should get reasonable results if $ is ’nearly’ linear, in some sense of small 
curvature. The crucial requirement is that the change of $, as one argument 
is changed, does not depend too strongly on the other arguments. We will 
formulate this requirement in terms of mixed partial derivatives, which in © 
will give us the bound 

B (5, n)=c(L + M) E X G (F (X)) + Ly/nhi(l/6) /2, 

where c is a (unfortunately rather large) universal constant. Here the bounded 
difference condition and our constraints on the mixed partial derivatives of $ 
are expressed in the quantities L and M respectively. For the arithmetic mean 
L = 1/n and M = 0, so the price we pay for the generality of <f> is the large 
constant and the presence of Gaussian instead of the Rademacher average. This 
price is due to the use of Talagrand’s majorizing measure theorem, a powerful 
result, which was the only working vehicle the author could find for the proof. 

The first nontrivial cases are furnished by U-statistics, and we will see that 
in this case M and L are of order 1/n, so that we obtain bounds of the same 
order as for the mean. It must at once be admitted that for U-statistics such a 
result, with small constant and Rademacher instead of Gaussian averages, has 
already been published by Clemencon et al (2008). Their method uses a trick 
introduced by Hoeffding (1963), which reduces U-statistics to linear functions. 
Nevertheless Hoeffding’s method uses permutation arguments and works only if 
the variables X t are identically distributed, while for our method they only need 
to be independent. Besides this, U-statistics possess a certain rigidity, while our 
result is applicable to a fairly large class of functions <F Generic members of 
this class have first partial derivatives uniformly bounded in order of 1/n and 
mixed partial derivatives uniformly bounded in order of 1/n 2 . These properties 
ensure L and M to be of order 1/n. 

The next section introduces some necessary notation, states our main result 
and sketches some applications. The last section is devoted to the proof of our 
main result. 


2 Main results 

Before stating our result we introduce some notation: the letter X always 
denotes some arbitrary set. If F is a function on X n of n variables, and 
x = (xi,...,x n ) £ X n we use i 7 /(x, y) to denote F (x') where x\ = Xi for 
i / fc and x' k = y. We use ei,...,e„ to denote the canonical basis of M”. If 
F is a twice differentiable function of several real variables then dkF is the 
partial derivative of F w.r.t. the fc-th variable, and dikF is the second partial 
derivative w.r.t. the fc-th and l -th variable. For functions F : X —» K we write 
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Halloo = su Pa;G^ \f ( x )\ ■ The letter c will always denote a universal constant, 
which is allowed to be modified within proofs from line to line in the standard 
way, so that, for example, 3c in one line can become c in the next line. If X is 
any random vector, X' will always be iid to X, which of course does not mean 
that the components of X are iid. 


Theorem 1 Let X = (Xi,X n ) be a vector of independent random vari¬ 
ables with values in X, X' iid to X, and let F be a finite class of functions 
f : X — > [0,1] . Assume $ : K™ — > R to be twice differentiable, satisfying the 
conditions 

Vfc, \\d k <t>\\oo < L (2) 


and 


Then 



E 

N 

l:k^l 


E sup [E4> (/ (X)) - $ (/ (X))] < c (M + L) EG (F (X)). 

feF 


(3) 

(4) 


Furthermore, if 5 > 0 then with probability at least 1 — 8 in X it holds for all 
f £ F that 


E [4> (/ (X'))] < $ (/ (X)) + c(L + M) E X G (F (X)) + L^ n]n ^ /S) /2. (5) 

Remarks: 

1. Clearly condition © is satisfied trivially with M = 0 for linear <!>. In 
general, to have bounds of order 1 j^fn we want both M and L to be of order 
1/n. This is guaranteed if the first partial derivatives are of order 1/n, and the 
mixed second partial derivatives are order 1/n 2 . 

2. Condition <[2j) is what we need for the application of the bounded difference 
inequality, and it will give us the last term in the generalization bound (f5j). 

3. The condition ([3]) is always satisfied if 


/ E II^IlL < M, 

k,l:k^l 


which is easier to verify. It may be that with a more careful analysis the condi¬ 
tion © can be further relaxed to 


N 


E ( a **) 2 

k,l:k^l 


< M. 

oo 


4. It is evident from the proof, that the differentiability assumption can be 
removed, if condition @ is replaced by the requirement that $ be L-Lipscliitz in 
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each coordinate separately, and condition © takes the form of a second order 
Lipschitz condition. The statement of the latter condition however appears 
somewhat cumbersome, so that here twice differentiability has been assumed 
for greater clarity. 

5. Other candidates for conditions on <f> come to mind, which would allow 
similar results. A simple one is the requirement that $ be a Lipschitz function 
with respect to the euclidean distance on M". Unfortunately the Lipschitz con¬ 
stant of the arithmetic mean is already 1/y/n 1 so with Rademacher or Gaussian 
averages being of order y/n no useful bounds result, not even in the simplest 
case. 


We conclude this section with some simple examples. First consider the 
sample variance given on [0,1]" by 

$ ( s ) = / 1 V ( Si ~ s i ) 2 • 

n(n — 1) 
v ' i<j 


Then 


dk$ (s) 


2 

n (n — 1) 


E ( Sfc “ 

i:i^k 


and for l ^ k 


di k $ (s) = 


-2 


.{n-iy 


from which we obtain L = 2/n and M = 2/y/n (n — 1) < 2 / (n— 
sample variance is a second order U-statistic with kernel k ( s , s') = (s - 
Now consider the general U-statistic of m-th order 


1). The 

-s'f/ 2 . 


<F(s) 



E 

«!<•••<*: 


Av Si m ) , 


where k : [0, l] m is a symmetric, twice differentiable kernel of m variables. Then 
for k € {1,..., n} 

|d fc $(s)|<-4y E |a fc K(s il ,...,s im )| < ^ HSikII^, 


and similarly for l ^ k 

. _ , . .. m (m — 1 ) .. _ .. 

( 8 )| < ^ \\dl 2 4oo . 

so that L and M are again of order 1/n. 

An example which is not a U-statistic and of practical relevance to learning 
theory is constructed as follows. Let fj, ± , ...,/x x be distributions on X represent¬ 
ing different classes of objects. From each of the we draw an iid sample and 
let X be the concatenation of these samples, where X has n elements. Observe 


5 







that the Xj and Xj are not identically distributed. For i,j £ {l,...,n} define 
r-ij = 1 if Xi and Xj are drawn from the same distribution and fjj = — 1 if 
Xj and Xj are drawn from different distributions. Let F consist of functions 
/ : X — > [0,1]. We seek a function / £ F which balances inter-class separation 
against intra-class proximity. An obvious candidate is the functional Ed) (/ (X)) 
with 

* (s) = <»<-»* >“■ 
v ' i<3 

Except for the this resembles the sample variance above, and it is immediate 
that we obtain the same bounds for M and L. On the other hand $ is not 
permutation-symmetric nor are the Xj identically distributed. 


3 The proof 

We need two important auxiliary results. The first is the well known bounded 
difference inequality, which goes back to Hoeffding (1963) (see also McDiarmid 
1998 and Boucheron et al 2013). Please recall the notation introduced at the 
beginning of the previous section. 

Theorem 2 Suppose F : X n —> R and X = (Xi,..., X n ) is a vector of indepen¬ 
dent random variables with values in X, X' is iid to X. Then 

Pr {F (X) - E F (X') >t}< exp , 

where 

n 

a 2 ( x ) = su p ( x > y ) _ F k ( x > z )f ■ 

fc=l V' zeX 


The second auxiliary result is due to Michel Talagrand (see Theorem 15 in 
Talagrand 1987 or Theorem 2.1.5 in Talagrand 2005). It is a consequence of the 
celebrated majorizing measure theorem (see e.g. Talagrand 1992). The version 
we state is proved in (Maurer 2014), adapted to zero mean processes and K = 1. 


Theorem 3 Let Xt be a random process with zero mean, indexed by a finite 
set T C M n . Suppose that for any distinct members t, t' £ Y and any s > 0 


Pr {Xt 


X v > s} < exp 


2 lit — t' 


Then 


E sup X t < c G (T) 

t£T 


where c is a universal constant. 


( 6 ) 
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The constant c which results from the proof is of course very large (in the 
hundreds). Nevertheless, as remarked in (Talagrand 1987), if X is a Gaussian 
process, then Theorem [3] reduces to Slepian’s Lemma (Boucheron et al 2013), 
which inspires the tantalizing conjecture that the optimal c could be in the order 
of unity, or even equal to one. 


We are now prepared for the proof of Theorem [l] 

Proof of Theorem [1] We first prove (JU), the proof of the generalization 
bound © then being an easy application of the bounded difference inequality. 

Let Q be the left hand side of (J3J ■ Initially our proof parallels the standard 
symmetrization argument: we pull the second expectation outside the supre- 
mum 


Q < E.yx' sup 

feF 


$ 


E/w 


- $ 


E/(**') 


Since Xi and X[ are iid, the last quantity does not change if we exchange Xj 
and X' on an arbirary subset of indices i. If cr £ {0,1}" is such that Oi is zero 
on this set and one on its complement, we obtain 


Q < Ex*'sup $ (e M (Xi) + (1 - *i) f (X')\ e^j 
(E I 0- */ + (1 _ °i) f i. x i)\ e; 

= Ea'x'E ct sup $ ^E ( X ‘) + (! - / (*i)] 

-$(E^/(^o + (i-^)/(^)] ei 


In the last step we took the expectation over configurations cr chosen uniformly 
from {0,1}". We now condition on the X\ and X[ (which we temporarily replace 
by lower case letters) and consider the random process 


Y f O) 


$ (E + (! - 

(E + ( 1 


ch) / K)] e iJ 

- 0i) f (xi)\ e. 


Clearly E a Yf (cr) = 0 for all / £ F. 

Now we want to apply Theorem [3] To this end we define a (pseudo-) metric 
on F by 


d(f,g) 


E “ 9 (^)) 2 + (/ ( x i) 

,*=i 



1/2 

,f,9€F 
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and seek to prove, for fixed f,g € F and s > 0 the inequality 


Pr {Y f 


Y g > s} < exp 


\{M* + I?)d{f, g y 


( 7 ) 


Let Z (cr) = Yf (cr) — Y g (cr). To prove (Q we will apply the bounded difference 
inequality, Theorem [2] to Z. Fix a configuration cr e{0,l}". We define the 
vectors A, B,C,D € [0,1]" by 


A = (o'if (Xi) + (1 — CTj) / (x'j)) e i 

i 

B = ^2 (?ig 0*4 + (1 - CTi)g(x'))ei 

i 

C = (' X i ) + (! - °4 / 0*0) e i 

i 

D = '^2(a z g(x , i ) + (l-a i )g(x))e z . 


Then for any k € {1,..., n} 


Z k (cr, 1) - Z k (cr, 0) 

= (A, f (x k )) - <J>, (B, g (x k )) + <f>fc {D, g (x' k )) - (C, f (4)) 
- (A, / (a:*.)) + <F fc (B,g(x' k )) - (D,g(x k )) + (C, / (a;*,)) 


Adding and subtracting the quantities (B, / (a;*,)), (B, f {x' k )), $ k (C, g (x k )) 

and <f>, (C, g ( x k )), rearranging terms, and using Jensens inequality (which is re¬ 
sponsible for the factor 1/8) we get 

i (Z fc (cr, 1) - (cr, 0)) 2 (8) 

< [$ fe (B, / (ar fc )) - (U, g (x k ))} 2 + [$, (B , g (x' k )) - (£, / (a;*))] 2 

+ [S* (4 / (x k )) - 4> fc (C, 5 (a-’fe))] 2 + [<h fc (C, (4)) - d> fc (C, f (x' k ))f 

+ [$fc {A f 0*4) - (a , / (4)) - ($fc (4 / 0*4) - (B, / (4)))] 2 

+ ['Pfc (A g (4)) - OA g 0*4) - {®k (c, g (4)) - (A g 0*4))] 2 


The first four terms are controlled with the coordinatewise Lipschitz condition 
©, and their sum is bounded by 


2L 


(/ (Xk) - g {x k )f + (/ (4) - g 0*4)" 


(9) 


The last two terms are bounded using the condition ((Sj on the mixed partials. 
Consider the term 


T := [$, (A, / Or,)) - d>, (A, / (4))] - [d>, (B, f (x k )) - (B, f (4))] . 





Define a function F : [0, l] 2 —> K by 

F (;t , s) = $ k ( tA + (1 - t) B , sf (x k ) + (1 - s) f (x' k )). 


Then 

T = [F (1,1 )-F (1,0)] - [F (0,1) - F (0,0)] = [ [ d 12 F (t, s ) dsdt, 

Jo Jo 

so that T 2 < sup s te [ 01 ] [d\ 2 F (t, s)] 2 . Now 

di 2 F (t, s) = ^2 ( d ik®k) (tA + (1 -t) B, sf (x k ) + (1 - s) f (x' k )) 

l:l^k 

x(f(x k )~ f (x'k)) (At- BO, 

and, using |/ (x k ) — / (x k )\ < 1, Cauchy Schwarz, and the definitions of A and 

B, 


sup <9i2 F (t, s) < 
s,te[o,i] 


< 


< 


I 2 


yi (dik^k) (Ai — Bi) 

l\l^k 
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y^ (dik^kY 

l:l^k 

y^ (Oik^kY 

l:l^k 


l\l^k 

d(f,9f ■ 


The last term in fl5J) is bounded in exactly the same way. Summing these bounds 
and the bound in d9]) over k we get 

£ ( Zk !) - Zk ( CT > °)) 2 < 16 ( M " + l2 ) d (/. 9 f ■ 


The bounded difference inequality then gives us 

( - 2 2 

Pr {Z > s} < exp 


8 (M 2 +L 2 )d(f,g) 2 


which proves the desired 0- _ 

Now let Hf be the process defined by Hf = Yf / sJd(M 2 + L 2 ). Then 


Pr{if/ — H g > s} < exp 


—s 


2 d(f, g y 
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Since d is exactly the euclidean metric on F (x, x') C M 2n we can apply Theorem 
[3] to Hf and conclude that 


EsupY/ = \J 4 (A/ 2 + L 2 )E ( sup Hf — Hf 0 

f \f 

< c\]M 2 + L 2 Esup (7 J (xi) + iif (a;-)). 

f i 

We now remove the conditioning and return to the AVvariables, to get 
Q < Exx'Ea sup Y f < cs/M 2 + L 2 E A -X'E 77 / sup V (7 J () + 7 ■/ (X-)) 

f f i 

< cy/M 2 + L 2 E sup V 7i/ (Xi). 
f 

J i 

This completes the proof of the first part of the theorem, inequality (J3J) , because 
y/M 2 + L 2 <M + L. 

For the second assertion let 'F (X) = supj gf (E [$ (/ (X'))] — d> (/ (X))) and 
write, just as in the introduction, 

T (X) = E [T (X')] + ('F (X) — E [\F (X')]). (11) 

The first term has already been bounded in (j4j). For the second term observe 
that, since the functions in F have range in [0,1], \F (X) changes at most by L 
if any of its arguments is modified. The bounded difference inequality gives 

Pr {T (X) - E [T (X')] >t}< exp ■ 

Equating to 8 and solving for t gives with probability at least 1 — 8 that 

vF (X) - E [\F (X')] < L^ nhl( 2 1/S) . 

Together with the decomposition m and the bound on E [\F (X)] implied by 
(U) this completes the proof of the generalization bound ([5]).B ■ 
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